Skip to content

feat: implement cronjob service and scheduler#229

Open
danielwpz wants to merge 11 commits intomainfrom
cronjob
Open

feat: implement cronjob service and scheduler#229
danielwpz wants to merge 11 commits intomainfrom
cronjob

Conversation

@danielwpz
Copy link
Copy Markdown
Contributor

No description provided.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a foundational cron job and task scheduling service within the application. It enables the definition of various background tasks, provides mechanisms to schedule them using AWS EventBridge Scheduler, and processes them asynchronously via an AWS SQS-backed worker. This system is crucial for automating recurring operational tasks, such as the newly introduced cleanup of instances belonging to canceled subscriptions, thereby improving resource management and system hygiene.

Highlights

  • New Task Management System: Introduced a robust system for defining, scheduling, and executing background tasks, including TaskScheduler and TaskExecutor traits, along with TaskId, ScheduleSpec, and TaskPayload enums.
  • AWS Integration for Scheduling and Queuing: Leveraged AWS EventBridge Scheduler for cron-like scheduling and AWS SQS for reliable task queuing and processing, providing a scalable and managed solution for background operations.
  • Dedicated Task Worker Binary: A new task_worker binary was added to serve as a dedicated service for consuming and processing tasks from SQS, ensuring separation of concerns and efficient handling of background jobs.
  • Automated Instance Cleanup Task: Implemented a cleanup_canceled_instances task to automatically identify and delete instances associated with long-canceled subscriptions, featuring configurable grace periods and dry-run capabilities.
  • Centralized Task Configuration: Added a TaskConfig struct to manage all task-related settings via environment variables, simplifying configuration and deployment of the new task system.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude bot commented Mar 25, 2026

PR #229 Review: feat: implement cronjob service and scheduler

Overall the architecture is solid — clean separation between services/jobs (domain types), api/tasks (AWS adapters), and the task_worker binary. Good test coverage and safe defaults (disabled by default). A few critical issues need addressing before merge.

CRITICAL

1. Privacy violation — NoopTaskPayload.note logged at INFO

crates/api/src/bin/task_worker.rs line 774 logs payload.note at INFO level. Per CLAUDE.md, user-supplied or operator-supplied text must not be logged at INFO in production. The note field is arbitrary text. Either drop it from the log statement or move to debug level.

2. SQS message never deleted on parse failure — infinite redelivery loop

In process_message (tasks/mod.rs), parse_task_message(body)? can return early via ? before the delete_message call. A malformed or unrecognized message will never be deleted from SQS and will be redelivered indefinitely until the visibility timeout expires. This permanently blocks any worker slot processing that queue position. Fix: delete the message (or send to a DLQ) even when parsing fails.

3. list_instances hard-coded to 1000 — silently skips instances beyond that limit

task_worker.rs line 826 calls list_instances(user_id, 1000, 0). Any user with more than 1000 instances has the remainder silently skipped during cleanup. This needs inner pagination (matching the outer user loop), or at minimum a warning when instances.len() == 1000.

4. API server startup blocked by AWS scheduler — startup fails if AWS is unavailable

main.rs calls ensure_daily_cleanup_task synchronously and propagates its error with ?. If AWS EventBridge/SQS is unreachable at deploy time the entire API server refuses to start. This turns an AWS connectivity issue into a full service outage. Consider logging a warning and continuing, or running the check as a background task after the server is already listening.

5. Transient SQS errors kill the worker process

run_forever propagates receive_message errors immediately via ?. Any transient network blip terminates the binary. The worker should retry on transient errors (with backoff) rather than exiting.


MINOR

6. ConflictException detection via string matching is fragile

tasks/mod.rs lines 1106 and 1154 detect already-exists errors by string-matching the error display. The AWS SDK exposes typed error variants — prefer those to avoid false positives.

7. upsert_task duplicates create_task_if_absent logic

Both methods share near-identical create + conflict-check + update code. Extract a shared helper to prevent future drift.

8. rust-toolchain.toml bump is a separate concern

The toolchain bump from 1.90.0 to 1.91.1 should be its own PR to isolate upgrade risk from the feature review.


⚠️ Issues found — the privacy logging violation (1) and the SQS poison-message infinite loop (2) are blockers for production.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new task worker system, integrating with AWS SQS and EventBridge Scheduler for managing and executing background tasks. Key changes include adding AWS SDK dependencies, defining task structures and executor/scheduler traits, and implementing an AWS-specific task scheduler and SQS worker. A new binary task_worker.rs handles task execution, including a daily cleanup for canceled instances. The API server is updated to ensure the daily cleanup task is scheduled on startup. Review feedback suggests improving performance in the cleanup task by processing instance deletions concurrently and enhancing error handling robustness in the AWS scheduler by matching specific SDK error types instead of string comparisons.

Comment on lines +103 to +121
for instance in cleanup_targets.drain(..) {
if let Err(err) = self.agent_service.delete_instance(instance.id).await {
failed_instances += 1;
tracing::error!(
"cleanup task: delete failed instance_id={} user_id={} status={} err={}",
instance.id,
user_id,
instance.status,
err
);
} else {
tracing::info!(
"cleanup task: deleted instance_id={} user_id={} previous_status={}",
instance.id,
user_id,
instance.status
);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation deletes instances for a user sequentially. If a user has many instances, this could be slow. You can improve performance by deleting them concurrently using futures::stream.

You'll need to add use futures::StreamExt; to the top of the file.

                let results = futures::stream::iter(cleanup_targets.into_iter())
                    .map(|instance| {
                        let agent_service = Arc::clone(&self.agent_service);
                        async move {
                            let res = agent_service.delete_instance(instance.id).await;
                            (res, instance)
                        }
                    })
                    .buffer_unordered(10)
                    .collect::<Vec<_>>()
                    .await;

                for (result, instance) in results {
                    match result {
                        Ok(_) => {
                            tracing::info!(
                                "cleanup task: deleted instance_id={} user_id={} previous_status={}",
                                instance.id,
                                user_id,
                                instance.status
                            );
                        }
                        Err(err) => {
                            failed_instances += 1;
                            tracing::error!(
                                "cleanup task: delete failed instance_id={} user_id={} status={} err={}",
                                instance.id,
                                user_id,
                                instance.status,
                                err
                            );
                        }
                    }
                }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a performance concern, but not a correctness issue. Given our current instance limits (5 max per user), it’s unlikely to materially affect normal production users

Comment on lines +126 to +159
match create.send().await {
Ok(_) => Ok(()),
Err(err) => {
let err_text = err.to_string();
if err_text.contains("ConflictException") || err_text.contains("already exists") {
let mut update = self
.client
.update_schedule()
.name(request.task_id.as_str())
.group_name(&self.scheduler_group)
.schedule_expression(&expression)
.flexible_time_window(flex_window)
.target(target);

if matches!(request.schedule, ScheduleSpec::At(_)) {
update = update.action_after_completion(ActionAfterCompletion::Delete);
}

update.send().await.map_err(|e| {
anyhow!(
"failed to update existing schedule for task_id={}: {:?}",
request.task_id,
e
)
})?;
Ok(())
} else {
Err(anyhow!(
"failed to create schedule for task_id={}: {:?}",
request.task_id,
err
))
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of matching on the string representation of the error, it's more robust to match on the specific error type provided by the AWS SDK. This avoids potential breakages if the error message string changes in a future SDK version.

You can use aws_sdk_scheduler::error::SdkError::ServiceError and match on the CreateScheduleError::ConflictException variant. This same pattern should also be applied to the create_task_if_absent function.

You'll need to add the following imports at the top of the file:

use aws_sdk_scheduler::error::SdkError;
use aws_sdk_scheduler::operation::create_schedule::CreateScheduleError;
        match create.send().await {
            Ok(_) => Ok(()),
            Err(SdkError::ServiceError(service_error))
                if matches!(service_error.err(), CreateScheduleError::ConflictException(_)) =>
            {
                let mut update = self
                    .client
                    .update_schedule()
                    .name(request.task_id.as_str())
                    .group_name(&self.scheduler_group)
                    .schedule_expression(&expression)
                    .flexible_time_window(flex_window)
                    .target(target);

                if matches!(request.schedule, ScheduleSpec::At(_)) {
                    update = update.action_after_completion(ActionAfterCompletion::Delete);
                }

                update.send().await.map_err(|e| {
                    anyhow!(
                        "failed to update existing schedule for task_id={}: {:?}",
                        request.task_id,
                        e
                    )
                })?;
                Ok(())
            }
            Err(err) => Err(anyhow!(
                "failed to create schedule for task_id={}: {:?}",
                request.task_id,
                err
            )),
        }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a basic background task system using AWS EventBridge Scheduler + SQS, including shared task definitions in services, API-side schedule creation on startup, and a dedicated SQS worker binary to execute tasks.

Changes:

  • Added services::jobs task primitives (IDs, schedules, payloads) plus dispatching helpers.
  • Added AWS scheduler + SQS worker plumbing in the API crate and a task_worker binary.
  • Introduced TaskConfig and env example entries to configure scheduler/worker behavior.

Reviewed changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rust-toolchain.toml Bumps Rust toolchain version.
env.example Documents new scheduler/worker env vars.
crates/services/src/lib.rs Exposes new jobs module.
crates/services/src/jobs/mod.rs Defines task types, scheduling spec, and dispatching.
crates/config/src/lib.rs Adds TaskConfig with env-driven defaults and tests.
crates/api/src/tasks/mod.rs Implements AWS Scheduler integration and SQS worker loop helpers.
crates/api/src/main.rs Ensures daily cleanup schedule at API startup.
crates/api/src/lib.rs Exposes new tasks module.
crates/api/src/bin/task_worker.rs Adds worker executable to process SQS task messages and run cleanup logic.
crates/api/Cargo.toml Adds AWS SDK dependencies.
Cargo.lock Locks AWS SDK + transitive dependency updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +6 to +9
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
#[serde(transparent)]
pub struct TaskId(String);

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskId has a new() validator, but deriving Deserialize with #[serde(transparent)] allows any string to deserialize into TaskId without validation. Consider enforcing validation during deserialization (e.g., #[serde(try_from = "String")] + TryFrom<String> calling TaskId::new), or validating in the SQS message parser.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +80 to +84
Err(err) => {
let err_text = err.to_string();
if err_text.contains("ConflictException") || err_text.contains("already exists") {
tracing::info!(
"schedule already exists, skipping create task_id={} group={}",
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_task_if_absent detects an existing schedule by substring-matching err.to_string() ("ConflictException"/"already exists"). This is brittle across SDK versions and can misclassify errors. Prefer matching on the typed AWS SDK error variants for conflict vs. other failures.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +128 to +132
Err(err) => {
let err_text = err.to_string();
if err_text.contains("ConflictException") || err_text.contains("already exists") {
let mut update = self
.client
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upsert_task also relies on substring-matching err.to_string() to detect create conflicts. Consider switching to matching on the concrete AWS SDK error variants (e.g., conflict) so schedule updates don’t depend on error text formatting.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +215 to +219
.wait_time_seconds(self.wait_seconds.clamp(1, 20))
.visibility_timeout(self.visibility_timeout.max(1))
.send()
.await
.context("failed to receive SQS messages")?;
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In run_forever, a transient receive_message failure will currently return an error (and terminate the worker) because the call is ?-propagated. For a long-running worker, consider catching/logging the error and retrying with a delay/backoff instead of exiting the loop.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +255 to +264
if let Some(receipt_handle) = message.receipt_handle() {
client
.delete_message()
.queue_url(queue_url)
.receipt_handle(receipt_handle)
.send()
.await
.context("failed to delete SQS message")?;
}

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If receipt_handle is missing, the code currently skips deletion and still returns Ok(()), which will cause the message to be redelivered and reprocessed indefinitely. Consider treating a missing receipt_handle as an error (or at least logging and returning an error) so it’s visible and doesn’t silently loop.

Suggested change
if let Some(receipt_handle) = message.receipt_handle() {
client
.delete_message()
.queue_url(queue_url)
.receipt_handle(receipt_handle)
.send()
.await
.context("failed to delete SQS message")?;
}
let receipt_handle = message
.receipt_handle()
.ok_or_else(|| anyhow!("SQS message missing receipt_handle"))?;
client
.delete_message()
.queue_url(queue_url)
.receipt_handle(receipt_handle)
.send()
.await
.context("failed to delete SQS message")?;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not changing this one in this PR. If receipt_handle is missing, we can’t delete the message either way, so returning an error instead of Ok(()) doesn’t actually change the redelivery behavior. In normal SQS ReceiveMessage responses a receipt handle is expected, so I’m treating this as an edge-case defensive improvement rather than a priority fix.

@danielwpz
Copy link
Copy Markdown
Contributor Author

@claude
3: This is a real pagination gap: the cleanup currently only reads the first 1,000 instances for a user. Given our current configured instance limits, it’s unlikely to affect normal production users, so I’m not treating it as a blocker for this fix set. I’d prefer to handle the inner pagination separately in a follow-up change.

other issues are fixed, please review again

@claude
Copy link
Copy Markdown

claude bot commented Mar 27, 2026

Claude finished @danielwpz's task in 58s —— View job


Re-review of PR #229

Tasks

  • Read trigger comment and understand request
  • Review current state of previously flagged issues
  • Verify fixes for issues 1, 2, 4, 5 (issue 3 deferred by author)
  • Post updated review findings

All four previously-blocking issues have been addressed. Here's the status:

Previously Critical — Now Fixed ✅

1. Privacy violation — NoopTaskPayload.note logged at INFO
Fixed. execute_noop now logs only "noop task received" with no payload content (task_worker.rs:19-22).

2. SQS poison-message infinite loop on parse failure
Fixed. parse_sqs_message now returns a ParsedSqsMessage::Drop { reason } variant, and process_message deletes the message before returning when it gets a Drop (tasks/mod.rs:276-310). Malformed messages are now consumed rather than endlessly redelivered.

4. API server startup blocked by AWS scheduler
Fixed. main.rs:44-52 now wraps the call in if let Err(err) = ... and logs a warn! rather than propagating the error with ?. AWS scheduler unavailability no longer blocks the API from starting.

5. Transient SQS errors kill the worker process
Fixed. run_forever now catches receive_message errors, logs a warning, sleeps for SQS_RECEIVE_RETRY_DELAY (5s), and continues the loop (tasks/mod.rs:226-234).


Issue 3 — Pagination gap at 1,000 instances (deferred ✓)

Acknowledged and accepted as a follow-up. The 1,000-instance hard cap (task_worker.rs:72) remains. A warning log when instances.len() == 1000 would be a cheap safety net in the interim, but not blocking.


Remaining Minor Issues

These were flagged as minor in the first review and remain unaddressed:

6. ConflictException detected via string matching (tasks/mod.rs:87, tasks/mod.rs:135)
Still fragile across SDK versions. Both create_task_if_absent and upsert_task check err_text.contains("ConflictException"). The AWS SDK exposes typed variants — this would be a safe improvement but is not a blocker.

7. upsert_task duplicates create_task_if_absent logic
Still duplicated. Not a blocker, but worth a cleanup pass.

8. rust-toolchain.toml toolchain bump bundled in this PR
Still present. Isolating the toolchain bump to a separate PR reduces blast radius if a regression is introduced, but this is a style preference.

9. Missing receipt_handle silently skips deletion (tasks/mod.rs:317-319)
Noted by Copilot and still present: if a message has no receipt_handle, process_message returns Ok(()) after executing the task without deleting the message. In practice EventBridge Scheduler always provides a receipt_handle, so this is low-risk, but returning an error would surface any unexpected cases.


Overall: The PR is in good shape. All critical blockers from the previous review are resolved. The remaining items are minor polish. Ready to merge once you've decided whether items 6–9 are worth a follow-up or addressing inline.

@danielwpz
Copy link
Copy Markdown
Contributor Author

@claude

Fixed the duplicate scheduler create/setup logic by pulling the shared request/create path into common helpers.

I’m keeping the rust-toolchain bump in this PR because it is required by the AWS SDK update in this change set.

Also fixed the logging issue, invalid-message handling, startup behavior, and the transient SQS receive retry path in the latest update.

@claude
Copy link
Copy Markdown

claude bot commented Mar 30, 2026

Claude finished @danielwpz's task in 1m 15s —— View job


Re-review of PR #229

Tasks

  • Read trigger comment and understand request
  • Read current state of tasks/mod.rs to verify typed error matching and helper extraction
  • Read current state of task_worker.rs and services/src/jobs/mod.rs to verify remaining fixes
  • Post final review findings

All previously flagged issues have been resolved. Here's the complete status:

Previously Critical — All Fixed ✅

# Issue Status
1 Privacy: NoopTaskPayload.note logged at INFO ✅ Fixed — execute_noop logs only "noop task received" (task_worker.rs:19-21)
2 SQS poison-message infinite loop on parse failure ✅ Fixed — ParsedSqsMessage::Drop variant; message deleted before returning (tasks/mod.rs:317-326)
4 API startup blocked by AWS scheduler errors ✅ Fixed — wrapped in if let Err in main.rs
5 Transient SQS errors terminate the worker ✅ Fixed — run_forever catches errors, warns, sleeps SQS_RECEIVE_RETRY_DELAY (5s), continues (tasks/mod.rs:240-249)

Previously Minor — All Fixed ✅

6. ConflictException now uses typed SDK variants (tasks/mod.rs:32-38)
Excellent. The new is_conflict_error helper correctly matches on SdkError::ServiceError and CreateScheduleError::ConflictException(_) — no more fragile string matching. Imports for SdkError and CreateScheduleError are at the top of the file.

7. Duplicate create/update logic extracted (tasks/mod.rs:20-124)
Clean refactor. ScheduleRequestParts, CreateScheduleOutcome, and try_create_schedule are now shared between create_task_if_absent and upsert_task, eliminating the duplication.

Copilot: TaskId deserialization bypassed validation (services/src/jobs/mod.rs:32-40)
Fixed. TaskId now has a hand-written Deserialize impl that routes through TaskId::new, enforcing the 1-64 char alphanumeric validation at parse time. Test at line 184 covers the error case.

Intentionally Deferred (Acknowledged)

  • Issue 3 — Pagination gap at 1,000 instances (task_worker.rs:72): Deferred to follow-up. Author's reasoning (max 5 instances per user in practice) is reasonable.
  • Issue 8 — Toolchain bump bundled in PR: Author confirmed the 1.91.1 bump is required by the AWS SDK dependency. Accepted.
  • Issue 9 — Missing receipt_handle returns Ok(()) (tasks/mod.rs:315, 332): Author's reasoning stands — if receipt_handle is absent, deletion would fail regardless of return value, so Ok(()) vs Err doesn't change the redelivery outcome. Low risk given SQS always provides one.

Overall: All critical and minor issues from previous reviews are resolved. The code is clean, well-tested, and the architectural decisions are sound. This PR is ready to merge.

@danielwpz
Copy link
Copy Markdown
Contributor Author

@gemini-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a background task system leveraging AWS SQS and EventBridge Scheduler. It adds a new task_worker binary, a task execution framework within the services crate, and a specific job for cleaning up agent instances linked to canceled subscriptions. Additionally, the Rust toolchain is updated to version 1.91.1. Feedback was provided concerning the list_instances call in the cleanup task, which currently lacks pagination and could leave orphaned resources if a user exceeds the 1000-instance limit.

Comment on lines +72 to +83
let (instances, _) = match self.agent_service.list_instances(user_id, 1000, 0).await
{
Ok(result) => result,
Err(err) => {
tracing::error!(
"cleanup task: failed to list instances user_id={} err={}",
user_id,
err
);
continue;
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation fetches only up to 1000 instances per user due to the hardcoded limit and lack of pagination. If a user has more than 1000 instances, any additional instances will not be cleaned up, leading to orphaned resources. To ensure all instances are processed, you should paginate through the results of list_instances.

                let instances = {
                    let mut all_instances = Vec::new();
                    let mut offset = 0;
                    const LIMIT: i32 = 1000;
                    loop {
                        match self.agent_service.list_instances(user_id, LIMIT, offset).await {
                            Ok((batch, total)) => {
                                let fetched_count = batch.len();
                                all_instances.extend(batch);
                                if all_instances.len() >= total as usize
                                    || fetched_count < LIMIT as usize
                                {
                                    break Ok(all_instances);
                                }
                                offset += fetched_count as i32;
                            }
                            Err(err) => {
                                tracing::error!(
                                    "cleanup task: failed to list instances user_id={} err={}",
                                    user_id,
                                    err
                                );
                                break Err(());
                            }
                        }
                    }
                };

                let instances = match instances {
                    Ok(i) => i,
                    Err(_) => continue,
                };

@danielwpz
Copy link
Copy Markdown
Contributor Author

@gemini-code-assist

This won’t happen in practice with the current limits. Each user can have at most 5 instances right now, so the 1,000 cap is well outside normal operation.

@gemini-code-assist
Copy link
Copy Markdown

Thank you for the update, @danielwpz. I have reviewed the latest changes, including the refactored scheduler logic and the addressed minor issues. The implementation of the task scheduling and worker service is now robust and well-structured. Given that the pagination gap is understood and deferred, and the toolchain bump is necessary for the SDK update, this PR is in excellent shape. I have no further concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants